A Comparison of Three Document Clustering Algorithms: TreeCluster, Word Intersection GQF, and Word Intersection Hierarchical Agglomerative Clustering
نویسنده
چکیده
This work investigated three techniques to automatically cluster a collection of documents: Word-Intersection with GQF, Word-Intersection with hierarchical agglomerative clustering, and TreeClustering. The Word-Intersection algorithms have been previously described in the literature while the TreeClustering technique is novel to this work. The TreeCluster algorithm idea comes from rule induction techniques and is used to generate a shallow tree of clusters that a user can browse. This algorithm is also O(n) when used with a fixed tree depth, as opposed to O(n) as the other two algorithms. Experimental results on a collection of Mail and Web documents indicate that the agglomerative clustering algorithm performed the best, but also the slowest. The GQF algorithm performed well on the Mail domain, but performed poorly on the Web domain without tuning to its heuristic. The TreeCluster algorithm performed reasonably well on both domains, and was also the fastest algorithm out of the three algorithms tested.
منابع مشابه
A Dynamic Programming Approach To Document Clustering Based On Term Sequence Alignment
Document clustering is unsupervised machine learning technique that, when provided with a large document corpus, automatically sub-divides it into meaningful smaller sub-collections called clusters. Currently, document clustering algorithms use sequence of words (terms) to compactly represent documents and define a similarity function based on the sequences. We believe that the word sequence is...
متن کاملFast and Intuitive Clustering of Web Documents
Conventional document retrieval systems (e.g., Alta Vista) return long lists of ranked documents in response to user queries. Recently, document clustering has been put forth as an alternative method of organizing the results of a retrieval 6]. A person browsing the clusters can discover patterns that would be overlooked in the traditional ranked-list presentation. In this context, a document c...
متن کاملFuzzy Clustering Approach Using Data Fusion Theory and its Application To Automatic Isolated Word Recognition
In this paper, utilization of clustering algorithms for data fusion in decision level is proposed. The results of automatic isolated word recognition, which are derived from speech spectrograph and Linear Predictive Coding (LPC) analysis, are combined with each other by using fuzzy clustering algorithms, especially fuzzy k-means and fuzzy vector quantization. Experimental results show that the...
متن کاملFast and Intuitive Clustering of Web DocumentsOren
Conventional document retrieval systems (e.g., Alta Vista) return long lists of ranked documents in response to user queries. Recently, document clustering has been put forth as an alternative method of organizing the results of a retrieval 4]. A person browsing the clusters can discover patterns that would be overlooked in the traditional ranked-list presentation. In this context, a document c...
متن کاملBayesian Hierarchical Clustering with Exponential Family: Small-Variance Asymptotics and Reducibility
Bayesian hierarchical clustering (BHC) is an agglomerative clustering method, where a probabilistic model is defined and its marginal likelihoods are evaluated to decide which clusters to merge. While BHC provides a few advantages over traditional distance-based agglomerative clustering algorithms, successive evaluation of marginal likelihoods and careful hyperparameter tuning are cumbersome an...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004